Satisfiability, sequence niches, and molecular codes in cellular signaling 
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Biological information processing as implemented by regulatory and signaling networks in living 
cells requires sufficient specificity of molecular interaction to distinguish signals from one another, 
but much of regulation and signaling involves somewhat fuzzy and promiscuous recognition of molec- 
ular sequences and structures, which can leave systems vulnerable to crosstalk. This paper examines 
a simple computational model of protein-protein interactions which reveals both a sharp onset of 
crosstalk and a fragmentation of the neutral network of viable solutions as more proteins compete 
for regions of sequence space, revealing intrinsic limits to reliable signaling in the face of promiscuity. 
These results suggest connections to both phase transitions in constraint satisfaction problems and 
coding theory bounds on the size of communication codes. 



INTRODUCTION 



The functioning of complex biomolecular pathways 
hinges on conveying molecular signals reliably in the 
stochastic and evolving milieu of living cells. These sig- 
nals are mediated by molecular interactions that distin- 
guish physiological binding partners from myriad other 
cellular constituents: this ability to distinguish func- 
tional signals from the molecular noise is ultimately the 
source of information processing in cellular networks. 
But molecular recognition is subtle: many of the molec- 
ular interactions involved in cellular regulatory and sig- 
naling pathways do not involve highly specific "lock and 
key" binding, but instead are characterized by more fuzzy 
and promiscuous recognition of families of sequences and 
configurations [3, 0i 0]. Furthermore, there are often 
multiple types of molecules within a cell that can bind 
to the same target, such as different proteins contain- 
ing homologous copies of a modular interaction domain. 
We therefore ask a basic theoretical question concern- 
ing cellular signaling in crowded sequence spaces, where 
multiple proteins bind to similar families of molecular 
sequences and structures: under what circumstances can 
crosstalk be avoided in such a system? This paper in- 
vestigates a simple null model, associated with random 
molecular sequences, that is amenable to analysis and 
suggests connections to recent work on phase transitions 
in combinatorial NP-complete problems. This random 
model is not directly applicable to the evolved molecu- 
lar sequences found in nature, but serves as a useful first 
step in defining the landscape of constraint satisfaction 
in cellular signaling. 

The theory of communication in noisy channels, dat- 
ing back to the seminal work of Shannon 0, [fl, also 
provides a useful framework in which to interpret cel- 
lular signals. Engineered error-correcting codes embed 
messages in higher-dimensional spaces (e.g., via encoded 
checks on the message integrity), to insulate each pos- 
sible codeword within a sphere in the embedding space. 
By packing such spheres so that they are disjoint, any 
corrupted message word can (up to some defined num- 
ber of errors) be uniquely associated with an original 



code word. In molecular signaling, sequence recognition 
volumes play a similar role: these volumes describe the 
sets of sequences recognized (i.e., bound with significant 
probability) by various target molecules. In molecular 
signaling, overlapping recognition of sequences precludes 
the sort of disjoint geometries found in engineered codes. 
Instead of asking, therefore, whether all messages can 
be communicated through a protein signaling channel, 
we will focus here instead on whether any message can 
be so conveyed (under the assumption that evolutionary 
selection might find such a solution if it does in princi- 
ple exist). Addressing the discrimination of potentially 
ambiguous signals, this work is related to issues arising 
in error-correcting codes, but geometrically it is in some 
ways more similar to problems involving covering codes 
@ and identifying codes A central result presented 
here, which establishes limits on the number of proteins 
that can compete for regions in sequence space before 
crosstalk becomes likely, is akin to a bound on the size 
of a code in a communication system. 

This work was motivated in part by experiments on 
SH3-mediated signaling in yeast {Saccharomyces cere- 
visiae), by Zarrinpar, Park and Lim Q. SH3 domains 
constitute a family of conserved modular protein do- 
mains, known to bind to a set of proline-rich peptide 
sequences (the so-called "PXXP" motif, which actu- 
ally consists of a larger peptide of approximately 8- 
10 residues) @, @. Because of this interaction promis- 
cuity, and because several proteins in yeast contain 
SH3 domains, it was not obvious whether there would 
be crosstalk among pathways involving different SH3- 
containing proteins. Zarrinpar et al. probed the yeast 
high-osmolarity signaling pathway, which involves the in- 
teraction of Shol (a protein with an SH3 domain) and 
Pbs2 (containing a PXXP motif). By making chimeric 
versions of Shol containing different SH3 domains, they 
demonstrated that none of the other native yeast SH3 
domains were capable of interacting with Pbs2, but that 
half of the metazoan SH3 domains they tested were able 
to do so. They surmised that there has been an evolu- 
tionary selection against crosstalk with that pathway in 
yeast, with protein sequences having co-evolved such that 
the Pbs2 ligand lies in a niche in sequence space where it 
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is recognized by only the Shol SH3 domain. Since there 
has been no such selection pressure to avoid crosstalk 
in other organisms, the Pbs2 motif bound to non-native 
SH3 domains with greater probability. (See supplemen- 
tary text and Figure S.l for further discussion.) It is the 
structure of these sorts of sequence niches that form the 
core of this paper. In related work, Sear has computed 
the capability of a set of competitive protein-protein in- 
teractions, and examined crosstalk avoidance in a model 
motivated by the same set of yeast signaling experiments 

EM- 

The fundamental questions posed by the experiments 
on SH3 signaling in yeast extend beyond that particular 
system. A classic problem in immunology is the ability of 
antibodies to discriminate between "self" and "nonself" 
antigens, and early work addressed how large a recogni- 
tion region needs to be in order to reliably perform this 
discrimination ;12J. In gene regulation, transcription fac- 
tors (TFs) that regulate gene expression by binding to 
DNA are organized in families that often recognize simi- 
lar sorts of sequences. Recent work in that area has ex- 
plored tradeoffs between binding specificity and system 
robustness [l3| , balances between selection and mutation 
[Til ], evolutionary divergence of competing TF-binding 
sequence pairs to avoid crosstalk [l5j], and the applica- 
tion of ideas from codin g th eory to understand limits on 
the size of TF families [16]. In bacterial signaling, the 
possibility of crosstalk among two-component regulatory 
systems, whereby multiple response regulators are acti- 
vated by a single sensor kinase, has also been explored to 
gain insight into how environmental signals are combined 



2. RESULTS 

2.1. The Sequence Niche Question 

We begin by distilling the central question to be 
considered here: under what conditions does a unique 
sequence niche exist so that signaling without crosstalk 
might be possible? To address this question, we adopt a 
highly abstracted model of protein-protein interaction, 
in which protein sequences are represented by binary 
strings of length L (consisting of O's and l's) rather 
than as peptide strings in the 20-letter amino acid 
alphabet. (Binary sequence models, such as the HP 
model, has been used in the study of protein folding 
[20| , although it remains an open question as to whether 
there is an appropriate coarse-grained alphabet capable 
of capturing the essential biochemistry of protein-protein 
interactions involved in signaling.) In this model, bind- 
ing of a peptide sequence to a protein is achieved if the 
sequence is sufficiently close to the consensus sequence 
recognized by the protein, with Hamming distance used 
as a measure of closeness: two sequences bind if they 
differ in at most R positions, given some promiscuity 
radius R. Given this representation, we can pose the 




FIG. 1: The Sequence Niche Question: given a target protein 
sequence T and a set of iV crosstalking protein sequences {C}, 
is there a sequence s that is bound by T but not by any of 
the proteins Ci? In this model, sequences are binary strings 
of length L, and two sequences bind if the Hamming distance 
between them is less than or equal to R. 



Sequence Niche Question, phrased and typeset in the 
canonical style of Garey and Johnson [2l| and illustrated 
schematically in Fig. [TJ 

SEQUENCE NICHE 

INSTANCE: Binary sequence T of length L, a set of 
binary crosstalk sequences Ci, for i = 1,...,N, each of 
length L, and an integer R, < R < L. 
QUESTION: Is there a binary sequence s of length L 
such that H(T, s) < R and H(C i: s) > R for i = 1, .., N, 
where H(x, y) is the Hamming distance between se- 
quences x and yl 

The Sequence Niche Question (SNQ) is a rephrasing of 
the Distinguishing String Selection Problem (DSSP), as 
defined by Lanctot et al. [2^]. (The DSSP allows more 
for multiple "good" strings S c to be matched within some 
Hamming distance k c and multiple "bad" strings Sf to 
be avoided outside some Hamming distance kf.) The 
DSSP was proven to be NP-complete (22J; the SNQ is the 
DSSP with S c — 1 and k c — kf, but the computational 
complexity of the DSSP does not depend on the values 
of these parameters, so the SNQ is also NP-complcte. 
The SNQ is similar in spirit to the well-known computer 
science problem SAT (and its specialization k— SAT), in 
that these problems ask whether there exists a solution 
that satisfies a set of (potentially conflicting) constraints 
[2lj . Borrowing from the language of SAT, we say a 
particular instance of the SNQ is "satisfiable" when a so- 
lution s exists, and "unsatisfiable" when there is no such 
solution. The SNQ asks whether discrimination of one 
target protein from a background of crosstalking proteins 
is possible. A symmetric generalization of this problem 
would ascertain whether every protein in a collection is 
distinguishable, that is, whether there is a separate se- 
quence niche for each of the N proteins. This gener- 
alized SNQ is essentially that considered by Sear [Til ] . 
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FIG. 2: (a) Average fraction of unsatisfiable instances of the random SNQ as a function of L,R, and N ((L,R) specified in 
figure legend, N varying along x-axis). (b) Average run time r of the SNQ decision (number of recursive calls in the solution 
algorithm) for the same instances depicted in (a). Averages are over fOO instances of the SNQ for each (L, R,N) set. 



although he did so for a 4-letter protein alphabet and a 
more realistic treatment of the binding kinetics than sim- 
ple Hamming distances, demonstrating that for at least 
some parameters, such discrimination is possible. The 
generalized SNQ is presumably in the same complexity 
class as the single-target SNQ, since deciding it simply 
involves deciding N separate SNQs. 

2.2. Satisfiability of the random SNQ 

The NP-completeness of the SNQ is a statement about 
its worst-case complexity, but there has been increas- 
ing interest in recent years in quantifying the typical- 
case complexity of NP-hard problems. A common strat- 
egy is to examine ensembles of random instances of NP- 
hard problems, investigating how solution complexity de- 
pends upon parameters that characterize those random 
instances. A similar strategy is adopted here. 

Multiple random instances of the SNQ were examined 
(with uniform equal probability of O's and l's in the se- 
quence strings) , for various values of the problem param- 
eters L, R, and N. Figure |2Ja) shows the average unsat- 



isfiable fraction of random SNQ instances as a function 
of the number of crosstalking proteins N, averaged over 
an ensemble of 100 random instances for each N. In ad- 
dition, Figure[^b) shows the average run time r required 
for determining whether or not an instance is satisfiable 
(where run time is measured in units of the number of 
recursive calls to the solution algorithm of Gramm et al. 
[23|). Fig. [Ha) demonstrates a transition from satisfi- 
ability (SAT) to unsatisfiability (UNSAT) as the num- 
ber of crosstalking proteins is increased. Rather than 
a gradual diminution in the capacity for reliable signal- 
ing, the SNQ exhibits a relatively abrupt switch as log N 
increases. Fig. HJb) reveals, for the same set of param- 
eter values, that the run time of the solution algorithm 
reaches a maximum near the point of the SAT-UNSAT 
transition. In other words, it becomes significantly more 
difficult to decide if a given instance is satisfiable or not 
when that instance lies near the transition. The charac- 
teristic scales of the random SNQ are seen to vary over 
orders of magnitude. For the solution run times, this is 
perhaps not surprising: since the SNQ is NP-complete, 
we expect the worst-case run time of the solution algo- 
rithm to be exponential in the size of the problem. 
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2.3. Scaling of the SNQ transition: a satisfiability 
bound on the number of crosstalking proteins 

Even though the characteristic scales of the SNQ vary 
by orders of magnitude, there is a scaling structure ev- 
ident in those data. This structure is understood by 
considering the geometric and probabilistic nature of the 
random SNQ. A given instance is unsatisfiable if the tar- 
get volume (i.e., the Hamming sphere of radius R sur- 
rounding the target sequence T) is completely covered 
by the union of the crosstalk volumes (centered about 
the crosstalk sequences {C}), a process that is illustrated 
schematically in Fig. 02a) . We can estimate the critical 
number of crosstalk proteins N c needed to cover the se- 
quence volume of the target protein (see supplementary 
text for full derivation): 



N r = 



l0g(l/VQ 

log(l - V/Vo) 



(1) 



where Vq(L) = 2 L is the total number of possible binary 

sequences of length L, and V(L, R) = J2n=o in) 1S ^ ne 
number of binary sequences in a ball of Hamming radius 
R about a given sequence. We can interpret this as a 
random satisfiability bound on the approximate number 
of randomly distributed proteins that can coexist without 
crosstalk. 

With this critical protein number, we can rescale the 
raw satisfiability and run time data of Fig. [2l These 
rescaled data are shown in Fig. in (b) and (c) the 
protein number (x-axis) is scaled as N — > (N — N c )/N c , 
and in (c), the run time data (y-axis) are scaled by the 
exponentially growing number of sequences in the search 
tree V(L, R) that in principle need to be considered. The 
collapse of each set of unsealed data onto a reasonably 
compact scaling form suggests this simple description is 
correct. 



2.4. Fragmentation of the solution space 

Previously we considered whether there is any solution 
to a given instance of the SNQ. Here we examine the 
structure of the space of all satisfying solutions for an 
instance, as determined via exhaustive enumeration. 

Consider a fixed target sequence T and a set of po- 
tential crosstalk sequences {C}. Imagine introducing 
crosstalk sequences one at a time, and identifying the 
set of all sequences {sn} that satisfy the SNQ for that 
instance with N crosstalk sequences. Of particular inter- 
est here is the size and structure of the solution set {s^} 
as a function of the number of proteins N. For each 
set, we assemble a graph whose nodes are sequences s 
that satisfy the SNQ and whose edges connect satisfying 
sequences if they are neighbors on the hypercube, i.e., if 
their Hamming distance from each other is 1. This graph 
represents the neutral network of all solutions to a given 
instance of the SNQ, along which single point mutations 
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FIG. 3: Scaling description of the SAT-UNSAT transition in 
the SNQ. (a) Schematic depiction of the covering of available 
sequences (black dots) in the target volume as crosstalk pro- 
teins (gray circles) are laid down randomly, (b, c) Scaling of 
the satisfiability and run time data in Fig. [3]based on the scal- 
ing theory presented: (b) the number of crosstalk proteins N 
are scaled by N — » (N — N c ) /N c , and (c) in addition to scaling 
N, the run times r are scaled by the number of sequences in 
the target volume V(L, R) that must be considered. 



to the solution string (bit flips) can be made without 
producing crosstalk. For various N, we can compute the 
set of connected components of the resulting graph. The 
change in the structure of the neutral network of satisfy- 
ing solutions is illustrated, for a given problem instance 
with L — 16 and R = 6, in Fig. 0J For small numbers of 
proteins (Fig. Ufa)), there are many possible solutions to 
the SNQ, and those solutions all coalesce into one con- 
nected cluster, such that any solution can be reached 
from any other via a succession of single-bit flips to the 
solution string. As N increases (Fig.Sfb)), the number of 
satisfying solutions decreases, and the connected cluster 
of solutions is fragmented into many disjoint sets (still 
dominated by a central core). This fragmentation and 
evaporation of the sequence clusters continues for larger 
N (Fig. (He)), until finally all solutions disappear, and 
unique signaling is no longer possible. While the neutral 
networks shown reveal the effects of mutations in the so- 
lution string s, it should be noted that single point mu- 
tations in the sequences representing the centers of the 
proteins T and {C} can result in drastic changes in the 
neutral network topology, e.g., by fragmenting a single 
large cluster into a set of smaller ones. 

A summary of these trends is shown in Fig. HJd), by 
averaging over many SNQ instances (for L — 16 and 
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FIG. 4: Fragmentation of the solution space as the SAT-UNSAT transition is approached, (a, b, c) The neutral network of 
satisfying solutions {sjv} for one particular problem instance (L — 16, R = 6), as a function of number of crosstalking proteins 
TV. Satisfying sequences (nodes) are connected by edges (lines) in a network if they are separated by Hamming distance 1. 
(The spatial layout of nodes has no meaning; all sequences are vertices on an L— dimensional hypercube.) (a) TV = 4: there 
are 5786 satisfying solutions in one large connected component. This cluster is broken up into multiple pieces as N increases, 
(b) TV = 12: 1226 sequences are distributed among 18 connected components, (c) TV = 20: only 85 sequences remain viable, 
scattered across 38 disjoint components, (d) For L = 16, R = 6, average values of the size of the largest connected sequence 
cluster (solid line) and the number of disjoint clusters (dashed line) as a function of N, averaged over 100 SNQ instances for 
each value of N. 



R = 6). This reveals that the size (i.e., the number of 
nodes) of the largest cluster (solid line) decreases roughly 
exponentially with crosstalk number N. We can under- 
stand this decrease in part by considering the geometric 
argument summarized in Fig. [3fa); see the supplemen- 
tary text for details. Also shown in Fig. 0Jd) is t he num- 
ber of disjoint clusters (dashed line); this is seen to ini- 
tially increase with N - as the single satisfying solution 
cluster is fragmented - and then decrease - as small se- 
quence clusters evaporate in the presence of new crosstalk 
proteins. Fig. [4] reveals a number of isolated clusters of 
size 1, but these problem sizes are rather small (given the 
computational burdens of exhaustive enumeration) . It is 
an open question whether nontrivial cluster size distri- 
butions will reveal themselves as larger problem sizes are 
considered. 



3. DISCUSSION 

The goal of this paper has been to examine the lim- 
itations of crosstalk-free signaling in a simple model of 
competitive protein-protein interactions, as a first step 
toward developing a more comprehensive and realistic 
theory. The numerical experiments presented were mo- 
tivated by phase transitions observed in the random 
/c-SAT problem [H, H HI, S3 , where there is a SAT- 
UNSAT transition as the ratio of constraints to variables 
is increased. The numerical results presented for the SNQ 
demonstrate something similar: a relatively sharp transi- 
tion from satisfiability to unsatisfiability with increasing 
competition for sequence space, along with an increase 
in computational complexity near the transition. Phase 
transitions have been studied in a number of NP-hard 
problems, although applications to biological problems 



have been scant and generally at coarser levels of bi- 
ological description [28|, l29(, despite significant interest 
in the computational complexity of problems involving 
sequence matching and discrimination [2^, HH. A sec- 
ond phase transition has more recently been identified 
in /c-SAT, lurking near the SAT-UNSAT phase bound- 
ary, involving the fragmentation of the set of satisfy- 
ing solutions [3(| HH, HH- We find evidence for such a 
fragmentation transition in small instances of the SNQ, 
although further theoretical and computational work is 
needed to fully characterize these transitions, which are 
only strictly defined in the limit of infinite system size. 

Of particular interest are the possible biological im- 
plications of these results. Where, for example, are sig- 
naling systems in nature situated with respect to these 
types of phase boundaries, and what sorts of codes has 
evolution uncovered in such systems? Have evolutionary 
innovations - such as novel interaction domains [l6| . or 
scaffolds that localize signaling proteins [33| - arisen to 
rescue cellular networks from the precipice of crosstalk? 
Signaling interactions do not occur in isolation, but of- 
ten involve compartmentalization or localization (e.g., on 
scaffolds) that confer context-dependent specificity in ad- 
dition to the intrinsic sequence specificity addressed here 
[H, HH, 36] . In addition, fragmentation of the network of 
satisfying solutions of the sort demonstrated here leads to 
complex neutral network topologies. The extent to which 
neutral network topology influences evolution remains an 
open question Neutral network fragmentation 

could lead to biological systems becoming frozen in local 
regions of sequence space, unable to mutate to other sat- 
isfactory configurations far away. This could produce a 
sort of speciation at the molecular scale, perhaps shed- 
ding light on phylogenetic relationships among related 
protein interaction domains. Larger-scale genomic rear- 
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rangements, such as homologous recombination and hori- 
zontal transfer, may play a role in helping biological com- 
munication systems become unstuck from a glassy, frag- 
mented phase where single point mutations are unable to 
do so. 

Examination of the SAT-UNSAT transition in random 
instances of the SNQ led to the derivation of a random 
satisfiability bound (eq. (|S8|) b This represents an upper 
limit to the number of randomly distributed sequences 
that can coexist without crosstalk becoming likely. While 
the bound was motivated by the SAT-UNSAT transition, 
it is also usefully interpreted within the context of coding 
theory bounds on the size of codes. Whereas a sphere- 
packing bound Q describes the number of Hamming 
spheres of radius R that can be packed in L dimensions 
with no overlap, and a smooth coding bound [l6[ allows 
for some overlap of sequence recognition spheres, our sat- 
isfiability bound is applicable to a dense, overpacked limit 
when all capacity for uniquely distinguishing signals dis- 
appears. The bound presented in eq. (fS8|) is explicitly ap- 
plicable to binary sequences without reverse-complement 
symmetry. It is straightforwardly generalizable (see sup- 
plementary text), within the assumption that binding is 
entirely dictated by the Hamming distance between two 
sequences, to sequences with larger alphabets (e.g., 20 
amino acids) or to sequences with reverse-complement 
symmetry (e.g., as has been done for other code bounds 
treating DNA sequences fl6l |39|). 

Protein sequences and sequence niches involved in cel- 
lular signaling have, of course, been sculpted by evolu- 
tion. We might expect evolution to be able to find better 
encoding schemes than the random placement considered 
here, by arranging sequence recognition volumes to max- 
imize fitness. Addressing this question, however, requires 
consideration of several factors. First, it is not obvious 
what fitness measure is optimized by natural selection. If 
discrimination among different sequences were the only 
determinant of fitness, we might expect encodings to 
more closely resemble sphere packings, with recognition 
volumes maximally distinct from one another. Other de- 
terminants could alter such packings, however; a fitness 
advantage from some weak crosstalk, perhaps as a form 
of degeneracy or functional redundancy [40| , might keep 
recognition volumes from diverging too far from one an- 
other. And it must be remembered that evolutionary 
mutation plays a central role in posing these constraint 
satisfaction problems in the first place, in that gene dupli- 
cation leads to the creation of homologous proteins that 
recognize similar substrates. The random limit consid- 
ered here, while useful for analysis, is not directly relevant 
to the biology of duplicated proteins that may diverge 
from one another just far enough to be distinguishable 

An examination of experimental and genomic data for 
model systems is an obvious next step, both to probe 
the structure and evolution of sequence niches in nature, 
and to develop more realistic and predictive models of 
protein-protein interaction. The experimental work re- 



ported in ref. [8[ included a series of single-base-pair mis- 
sense mutations to the native yeast Pbs2 motif, to probe 
the sequence space around that motif. All such mutations 
resulted in an increased cross-reactivity with other yeast 
SH3 domains, suggesting that the Pbs2 ligand lies near 
the periphery of a possible sparse and tenuous sequence 
niche, but further examination of yeast SH3 interaction 
data is needed to better characterize that. Fortunately, 
there has been considerable experimental work in screen- 
ing synthetic peptide ligands to map out the sequence 
recognition volumes of SH3 domains in several proteins 
[4lj . and similar sorts of data are becoming available for 
systems such as two-component regulators [l9j]. Compu- 
tational classifiers (e.g., weight matrices and neural net- 
works) trained on protein-protein interaction data have 
been used to make predictions about binding affinities of 

Hi- With a 



particular proteins to arbitrary peptides 4^, 
combination of experimental data, computational mod- 
els, and a theoretical understanding of the complexities 
of constraint satisfaction problems, we can aim to map 
out the structure of high-dimensional sequence niches un- 
derlying cellular decision making in biological systems. 
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SUPPLEMENTARY MATERIAL 

S.l. Derivation of critical number of crosstalking 
proteins (random satisfiability bound) 

Here we derive the result stated in eq. (1) of the main 
text, the critical number of crosstalking proteins N c for a 
given sequence length L and promiscuity radius R, which 
we can interpret as a random satisfiability bound for the 
size of the protein-protein interaction code. A given in- 
stance of the SNQ is unsatisfiable if the target volume 
(i.e., the Hamming sphere of radius R surrounding the 
target sequence T) is completely covered by the union of 
the crosstalk volumes (centered about the crosstalk se- 
quences {C}), a process that is illustrated schematically 
in the main text in Fig. 3(a). We can estimate the criti- 
cal number of crosstalk proteins iV c needed to cover the 
sequence volume of the target protein. For a given binary 
string of length L, the number of sequences V(L, R) in a 
ball of Hamming radius R is 



V(L,R) = J2 

n=0 



(SI) 
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and the total possible number of sequences Vq(L) is 

V Q (L) = 2 L (S2) 

Let q be the ratio of these sequence volumes: 

q = V/V (S3) 

We consider depositing at random sequence volumes of 
size V(L,R) in a space of volume Vo(L). From the bi- 
nomial distribution, the probability that a given point in 
sequence space is covered n times after N proteins have 
been deposited is 



P q (n\N) 



N 



q n (l-q) 



N-n 



(S4) 



Therefore the probability U q (N) that a given point in 
sequence space is left uncovered by N proteins is 



U q (N) = P q (0\N) = (!-«) 



JV 



(S5) 



We can thus estimate the average number of sequences 
S u (V,q,N) in the target volume V left uncovered by N 
proteins to be 



S u (V,q,N) = V(l-q) 



N 



(S6) 



We wish to estimate the critical number of proteins N c 
required to cover the target volume; since the sequence 
space is discrete, we estimate N c as the number of pro- 
teins for which there is 0(1) remaining uncovered se- 
quence in the target volume. This yields 



V(l-q) 



N,, 



1 



which implies 



Nr. 



iog(i/VQ 

log(l - V/V ) 



(S7) 



(S8) 



The estimate (|S8|) appears to adequately describe the 
SNQ simulation data presented in the main text, as in- 
dicated by the scaling collapses shown in Fig. 3 of the 
main text. We expect the quality of the estimate to de- 
grade, however, as the discrete nature of the sequence 
space becomes more important, i.e., as the number of 
sequences in the target volume V(L, R) becomes small 
(of O(l)). Indeed, for the situation R = 0, where there 
is only one sequence in the target volume to be covered 
(namely the target sequence T) , the estimate (|S8|) yields 
N c — 0. For this case, however, we can independently 
estimate the number of randomly situated crosstalking 
sequences required to insure that the target sequence T 
is covered with probability 1/2: 



1-(1 =1/2 



implies 



N. 



R=0 



log(l/2)/log(l 
log(l/2)/log(l 



q) 

i/vb) 



(S9) 



(S10) 
(Sll) 



The result (|S8|) assumes an alphabet size q = 2 (i.e., 
binary sequences). We can generalize the satisfiability 
bound in a straightforward manner, if we assume that 
binding of two sequences continues to be dictated by a 
maximal Hamming distance, i.e., two sequences s\ and 
S2 will bind if H(si,S2) < R- In this case, the form of 
the bound (|S8[) remains unchanged, and we need simply 
redefine the relevant sequence volumes corresponding to 
an alphabet of size q: 



V(L,R)=V(L,R,q) = ]T 

71 

V (L) = V (L,q) = q 



n=0 
L 



L 



(q-l) n (S12) 
(S13) 



In the case of reverse complement symmetric (RCS) 
sequences (e.g., for binding of protein to DNA in the 
regulation of gene transcription), the bound is reduced 
because each sequence in the target volume can be cov- 
ered either by a ball centered within Hamming distance 
R of the sequence, or by a ball centered within distance 
R of the reverse complement of that sequence. This has 
the effect of doubling the coverage ratio q: q = 2V/V~o- 
As a result, 



N, 



RCS 



log(l - 2V/V ) 



(S14) 



which is valid for R < L/2. For R > L/2, N* cs = i. 

The main text alludes to a symmetric generaliza- 
tion of the SNQ that asks whether every protein in 
a collection is distinguishable, that is, whether there 
is a separate sequence niche for each of N proteins. 
While we do not have a general estimate for the critical 
number of proteins N c for this problem, we can produce 
such an estimate for the special case of R = 0, where 
crosstalk occurs only if two sequences are exactly the 
same (no mismatches). In that limit, the question 
boils down to this: For binary sequences of length 
L, how many randomly chosen sequences must be 
chosen for there to be a probability of at least 1/2 that 
two sequences are identical? This is just the classic 
"birthday problem" of probability theory, for a system 
where a "year" contains Vq = 2 L possible days (see, e.g., 
http : //en. wikipedia. org/wiki/Birthday_problem). 
The probability p(n) that two sequences out of n will 
match is: 



p(n) 



1 



To! 



(Vb - n)\ V 7 



(S15) 



so, for a given sequence length L, we can find the number 
N c for which this probability exceeds 1 /2 to arrive at an 
estimate for the R = bound of the generalized SNQ. 

In this light, the R = case for the original SNQ (eq. 
(|S11[) ) can be seen as a variant of the "my birthday prob- 
lem" , which asks for the probability that someone in a 
group of N people will share my birthday. The proba- 
bility of any crosstalk sequence matching the target se- 
quence (in the original SNQ) is of course smaller than the 
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probability that any two crosstalk sequences will match 
each other (in the generalized SNQ). For R > 0, estimat- 
ing the bound would seem to be a variant of the near- 
match birthday problem [44j . but in higher dimensions. 

S.2. Size of the largest solution cluster 

Fig. 4(d) of the main text demonstrates that the size 
So °f the largest cluster (solid line) decreases roughly 
exponentially with crosstalk number N. From the geo- 
metric argument illustrated in Fig. 3(a) in the main text, 
we might expect 

So ~ (1 - q) N ~ exp(-gJV) for small q (S16) 

where q = V{L,R)/V {L) asineq. JS3J. ForL = 16,i? = 
6, q w 0.23, and a fit to the cluster size data in Fig. 4(d) 
reveals So ~ exp(— 0.297V). The exponential approxi- 
mation to the power law in eq. (STo]) would be more 
accurate for smaller q, but part of the discrepancy be- 
tween the predicted and measured decay rate is due to 
the fact that the geometric argument only describes the 
elimination of viable sequences by crosstalk proteins, and 
not the fragmentation of clusters. Some of the decrease 
in So is due to the latter effect. 

S.3. Review of results from Zarrinpar, Park and 
Lim 

We describe here in slightly more detail the experimen- 
tal results of ref. [1] . Zarrinpar et al. investigated SH3- 
mediated signaling in yeast (Saccharomyces cerevisiae), 
probing in particular the signaling pathway involved in a 
high-osmolarity response, predicated on the interaction 
of the Shol protein (containing an SH3 domain) and the 
Pbs2 protein (with an exposed proline-rich, PXXP, pep- 
tide sequence). Experimentally, they created chimeric 
versions of the Shol protein, replacing the native SH3 
domain with each of the other 26 SH3 domains found in 
yeast. (Three of the Shol chimeras were insoluble, how- 
ever, so they could not be assayed in vivo.) They then 
sought to determine whether any of those domains could 
reconstitute the function of the high-osmolarity pathway, 
and found that none of the other yeast domains could so 
function. In vitro peptide binding assays also carried out 
revealed a similar lack of interaction from any but the 
Shol-Pbs2 pair. When SH3 domains from 12 metazoan 
proteins were tested (both in vivo and in vitro), however, 
it was discovered that 6 of those were able to reconsti- 
tute the function of the high-osmolarity pathway. Their 
interpretation was that there has been an evolutionary se- 
lection against crosstalk in yeast, whereby domains and 
peptides have evolved such that the Pbs2 PXXP motif 
lies in a niche in sequence space where it is recognized 
by only the Shol SH3 domain, as is illustrated schemat- 
ically in Fig. IS.lf a). Since there has been no such se- 
lection pressure in other organisms, it was perhaps not 



surprising that the Pbs2 motif overlaps with the recog- 
nition volumes of many of non- yeast SH3 proteins, as is 
illustrated in Fig.lSlTb). 

Zarrinpar et al. also sought to characterize the nature 
of protein-protein interactions in the sequence space sur- 
rounding the wild-type Pbs2 motif, which they did by as- 
saying a library of 19 single-base-pair missense mutations 
to the native yeast Pbs2 motif (leaving the core prolines 
of the PXXP motif unchanged). While some mutations 
resulted in increase affinity for Shol, and some resulted in 
decreased affinity, all mutations resulted in an increased 
cross-reactivity with other yeast SH3 domains. This sug- 
gests that the wild-type Pbs2 is optimized not for affinity, 
but for discrimination among different SH3 domains. 



S.4. Methods 

To ascertain whether a given instance of the SNQ 
was satisfiable or not, I implemented the alg orithm by 
Gramm et al. [23[ ("Algorithm D" in [23], modified 
as described to treat the Distinguishing String Selection 
Problem). This is a recursive, backtracking algorithm 
in the style of Davis-Putnam(DP)-type methods used in 
the study of other NP-complete problems (e.g., A:— SAT 
[111). Algorithm D in [23[ implements heuristics to prune 
the search tree, tailored to the Distinguishing String Se- 
lection Problem (DSSP). DP-type algorithms are known 
to be significantly slower in practice for k— SAT than 
other algorithms (e.g., WalkSAT [46| or survey propa- 
gation [30]), but have the advantage of being complete, 
i.e., able to determine whether any instance is satisfi- 
able or not, given sufficient computer time. (Incomplete 
algorithms can typically find a solution if there is one, 
but are not guaranteed to stop if there is no solution.) 
For forays into a newly-identified NP-complete problem 
such as this, complete algorithms are a useful first step. 
For each SNQ instance, it was determined whether the 
instance was satisfiable, and how long it took to decide 
that question. Since DP-type methods are recursive, it 
is conventional to measure algorithm run times in units 
of number of calls to the recursive core, which is what we 
have done here. 

The SNQ, as stated, applies to any set of sequences T 
and {C}. This paper has focused on random instances 
of the SNQ, where the relevant sequences are sampled 
uniformly at random from the set of all binary sequences 
of length L, with equal probabilities of and 1 in the 
sequences T and {C}. Simulations of random instances 
of the SNQ were carried out, for various values of the rel- 
evant control parameters: the string length L, the Ham- 
ming radius R, and the number of crosstalk proteins TV. 
Average satisfiability and algorithmic run time were com- 
puted from 100 random SNQ instances for each set of L, 
R, and N. 

To explore the full solution space of SNQ instances, 
exhaustive examination was carried out. For each of the 
possible 2 L sequences, it was determined whether that se- 
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Sho1 SH3 Sho1 SH3 

recognition recognition 




profiles 



FIG. S.l: The interpretation offered by Zarrinpar, Park and Lim to describe (a) the lack of crosstalk among S. cerevisiae 
SH3 domains and (b) the presence of crosstalk among non-5 1 , cerevisiae SH3 domains. [Adapted from ,8].] (a) In S. cerevisiae, 
evolutionary selection against crosstalk has driven the proline-rich Pbs2 motif to a niche where it is recognized only by the 
Shol SH3 domain, (b) There is no such selection pressure in other organisms, so domains introduced from elsewhere can bind 
Pbs2. 



quence satisfied the given SNQ. The set of valid solutions 
was assembled to form an undirected graph, whose nodes 
were SNQ solutions and whose edges joined nodes with 
sequences that differed by Hamming distance of 1, i.e., 
by 1 bit flip. The network analysis package NetworkX 
[networkx.lanl.gov] was used to compute connected com- 
ponents of the resulting graphs, and to generate layouts 
for visual display. This work motivated a contribution 
on my part to the NetworkX source code repository [net- 
workx.lanl.gov/changeset/223], using tuples of index co- 
ordinates to label grid graphs, such as would be used to 
represent an L-dimensional hypercube. This representa- 



tion is natural for graphs connecting nodes in sequence 
space. A spring force layout algorithm was used to gener- 
ate the images in Figs. 4(a)-(c) in the main text, whereby 
connected nodes are attracted to each other to produce 
compact representations of connected components. As 
noted, however, the positions of the graph nodes in Figs. 
4(a)-(c) have no intrinsic meaning, as all nodes are ver- 
tices on the L-dimensional hypercube. The problem of 
usefully visualizing complex network structures in high- 
dimensional sequence spaces is an ongoing challenge in 
computational biology. 
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